import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import plotly.express as px
user_df = pd.read_csv('USER_TAKEHOME.csv')
user_df.shape
user_df
| ID | CREATED_DATE | BIRTH_DATE | STATE | LANGUAGE | GENDER | |
|---|---|---|---|---|---|---|
| 0 | 5ef3b4f17053ab141787697d | 2020-06-24 20:17:54.000 Z | 2000-08-11 00:00:00.000 Z | CA | es-419 | female |
| 1 | 5ff220d383fcfc12622b96bc | 2021-01-03 19:53:55.000 Z | 2001-09-24 04:00:00.000 Z | PA | en | female |
| 2 | 6477950aa55bb77a0e27ee10 | 2023-05-31 18:42:18.000 Z | 1994-10-28 00:00:00.000 Z | FL | es-419 | female |
| 3 | 658a306e99b40f103b63ccf8 | 2023-12-26 01:46:22.000 Z | NaN | NC | en | NaN |
| 4 | 653cf5d6a225ea102b7ecdc2 | 2023-10-28 11:51:50.000 Z | 1972-03-19 00:00:00.000 Z | PA | en | female |
| ... | ... | ... | ... | ... | ... | ... |
| 99995 | 61fc06d41febf771966da8fa | 2022-02-03 16:46:12.000 Z | 1992-03-16 08:00:00.000 Z | CA | en | female |
| 99996 | 6391e7ef90ad5449ec5f782d | 2022-12-08 13:34:39.000 Z | 1993-09-23 05:00:00.000 Z | MO | en | female |
| 99997 | 637d5efdd6f2a49c49934dcb | 2022-11-22 23:45:05.000 Z | 1983-04-19 00:00:00.000 Z | RI | en | female |
| 99998 | 5f0de23b05d8a6147dc0cafa | 2020-07-14 16:50:04.000 Z | 1995-06-09 04:00:00.000 Z | DE | en | female |
| 99999 | 5e7ab436905e9512ff44abc8 | 2020-03-25 01:30:30.000 Z | 1995-12-15 05:00:00.000 Z | VA | NaN | female |
100000 rows × 6 columns
#Familiarizing with Dataset
user_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 100000 entries, 0 to 99999 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 100000 non-null object 1 CREATED_DATE 100000 non-null object 2 BIRTH_DATE 96325 non-null object 3 STATE 95188 non-null object 4 LANGUAGE 69492 non-null object 5 GENDER 94108 non-null object dtypes: object(6) memory usage: 4.6+ MB
#Checking for Null Values
user_df.isnull().sum()
ID 0 CREATED_DATE 0 BIRTH_DATE 3675 STATE 4812 LANGUAGE 30508 GENDER 5892 dtype: int64
# Calculate the percentage of null values for each column
null_percentage = (user_df.isnull().sum() / len(user_df)) * 100
#Percentage of Blank Values
print(null_percentage)
ID 0.000 CREATED_DATE 0.000 BIRTH_DATE 3.675 STATE 4.812 LANGUAGE 30.508 GENDER 5.892 dtype: float64
Based on the above exploration, we can see that 30% values for Language columns are Blanks/Null. This is a very high % value can can be considered as a data quality issue as It can act as an hurdle in understanding the User Demographics - Language barrier is an important factor.
#Checking for any duplicated rows as A user should be unique
user_df.duplicated().sum()
0
None of the rows are duplicated in this dataset.
#Checking the unique values of State Column
user_df['STATE'].unique()
array(['CA', 'PA', 'FL', 'NC', 'NY', 'IN', nan, 'OH', 'TX', 'NM', 'PR',
'CO', 'AZ', 'RI', 'MO', 'NJ', 'MA', 'TN', 'LA', 'NH', 'WI', 'IA',
'GA', 'VA', 'DC', 'KY', 'SC', 'MN', 'WV', 'DE', 'MI', 'IL', 'MS',
'WA', 'KS', 'CT', 'OR', 'UT', 'MD', 'OK', 'NE', 'NV', 'AL', 'AK',
'AR', 'HI', 'ME', 'ND', 'ID', 'WY', 'MT', 'SD', 'VT'], dtype=object)
#User counts for each state
state_counts = user_df['STATE'].value_counts()
#Converting state_counts Series to a DataFrame
state_counts_df = state_counts.reset_index()
state_counts_df.columns = ['STATE', 'User Count']
state_counts_df['Percentage'] = (state_counts_df['User Count'] / state_counts_df['User Count'].sum()) * 100
state_counts_df['Percentage'] = state_counts_df['Percentage'].apply(lambda x: f'{x:.2f}%')
#Plotting a Bar chart for User Count vs State
fig = px.bar(state_counts_df, x='STATE', y='User Count',
height=600, width=1300, hover_data={'Percentage': True})
fig.update_layout(
title={
'text': "Users distribution based on States",
'y':0.95,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'},
xaxis_title="States",
yaxis_title="User Count"
)
fig.update_xaxes(tickangle=-40)
fig.show()
#User counts based on Gender
user_df['GENDER'].value_counts()
female 64240 male 25829 transgender 1772 prefer_not_to_say 1350 non_binary 473 unknown 196 not_listed 180 Non-Binary 34 not_specified 28 My gender isn't listed 5 Prefer not to say 1 Name: GENDER, dtype: int64
Here for Gender column, we see that there is a lot of inconsistency based on values. This can be considered as a data quality issue. To better userstand the Gender of Fetch's User, we can group similar values together to only have the main gender categories.
#Grouping the similar categories to one group and visually representating users based on Genders
gender_mapping = {
'female': 'Female',
'male': 'Male',
'transgender': 'Transgender',
'non_binary': 'Non Binary',
'Non-Binary': 'Non Binary',
'prefer_not_to_say': 'Prefer not to Say',
'unknown': 'Unknown',
'not_listed': 'Unknown',
'not_specified': 'Unknown',
'My gender isn\'t listed': 'Unknown',
'Prefer not to say': 'Prefer not to Say'
}
#Creating a dataframe which has User counts for each main gender category
grouped_gender_df = (user_df['GENDER']
.map(gender_mapping)
.value_counts()
.reset_index()
.rename(columns={'index': 'Gender_Group', 'GENDER': 'User Count'}))
#Creating the pie chart for Gender
fig = px.pie(grouped_gender_df, names='Gender_Group', values='User Count',
height=600, width=900)
fig.update_layout(
title={
'text': "Users Distribution based on Gender",
'y': 0.95,
'x': 0.5,
'xanchor': 'center',
'yanchor': 'top'
}
)
# Show the plot
fig.show()
#Replacing Blanks with NaN
user_df['LANGUAGE'].replace('', np.nan, inplace=True)
print('The total language records that are not Null: ', user_df['LANGUAGE'].value_counts().sum())
#Getting the percentage of Users based on their preferred Language
lang_counts = (user_df['LANGUAGE'].value_counts() / user_df['LANGUAGE'].value_counts().sum()) * 100
print(lang_counts.apply(lambda x: f'{x:.0f}%'))
The total language records that are not Null: 69492 en 91% es-419 9% Name: LANGUAGE, dtype: object
We can see above that out of the Total Not Null Values, 91% Users have their preferred language as English.
The major data quality issues found in this dataset:
1) Assuming the language column is the Preferred language of every User, 30% of Values are missing. This can be a major data quality issue as it will hinder user personalization and targeted marketing.
2) Gender has a lot of inconsistent values. To make proper sense of data and use it in future for data analysis, we will need to group the values into main categories(similar to the way I did for plotting Pie chart) to better userstand Fetch's Consumer base. Also, small Percentage of values in Gender are missing.
Minor Data Quality issues found in this dataset:
1) Small percentage of values are missing in State and Birth_Date Colums which will have to be dealt with either by imputing some values or removing the rows as per the use of columns in data analysis